A comparative analysis of factors influencing colorectal cancer’s age standardized mortality ratio among Korean women in the hot and cold spots

The study aimed at exploring factors that most influence colorectal cancer (CRC) age standardized mortality ratio (ASMR) among Korean women, as reported in previous studies. The factors used the data of 250 municipalities from the Korean Statistical Information Service (KOSIS) from 2010 to 2018. In the exploratory survey, over 70% of women aged 65 and above died of colorectal cancer. After investigating the existing literature and theories, 250 regions were classified into hot and cold spots according to age standardized mortality ratio (ASMR). The Nearest Neighbor Index (NNI), Moran’s I index and The Durbin-Watson test were also utilized. The ASMR’s regional cluster analysis showed that the inland areas were the hot spots and the cold spots were in the southwest coastal areas. The result also showed the differences in dwellers’ lifestyles between these two regions as well as the mean difference between the two. In addition, there was no significant difference in ASMR for breast cancer, CRC deaths, and agricultural product shipments between the two regions. In the multiple regression model, CRC mortality, diabetes, and CRC age standardized incidence ratio (ASIR) were analyzed as major influencing factors, demonstrated a significant result with 30.6% by examining the adjusted R-squared. However, this study showed that factors such as smoking, alcohol consumption, abdominal obesity, breast cancer, and food consumption indicated to have less influence on the occurrence of CRC. The aging rate, amount of food consumption, seafood production, livestock product shipments, and drinking rate were higher in the cold spot than in the hot spot.


Introduction
The world is challenged by the emergence of colorectal cancer (CRC) of which its cure has become the unceasing quest for both the sick and the doctors, medical device programmers, and other scientists who are trying to find solutions to the existing health concern. In 2018, the National Cancer Information Center (NCIC) in South Korea reported CRC as the third

Materials and methods
The researchers reviewed the related literature and investigated the factors and major theories of CRC. Thereafter, an exploratory study was conducted to confirm the age distribution of female CRC deaths. For the data related to the dependent variable or explanatory variable, statistical data is used for 9 years from 2010 to 2018 on the Korean Statistical Information Service (KOSIS).

Review process
From the data taken from KOSIS, the research is gazes on the quantity of food consumption (fish, meat, vegetables) as well as the alcohol and cigar consumption as a hypothetical causes of mortality in the targe hot spots like Jangan-gu, Gwonseon-gu, Paldal-gu, and Yeongtong-gu in Suwon-si, Sangdang-gu, Seowon-gu, and Cheongwon-gu in Cheongju-si, Gyeryong-si, Boeun-gun, Goesan-gun, and the cold spots like the Jeongeup-si, Sunchang-gun, Seongdonggu, Ongjin-gun, Jangseong-gun, Wando-gun, Sacheon-si, Sancheong-gun, Hwasun-gun (KOSIS, 2010(KOSIS, -2018. However, the KOSIS did not provide any Eco-friendly certified agricultural products shipment and eco-certified livestock starting 2010-2013 but it does not provide any individual data of the districts within the city so the data is distributed to every district for example in the city of Cheongju there are 4 districts; Heungdeok-gu, Seowon-gu, Sangdang-gu, Cheongwon-gu. The KOSIS started to provide a complete data starting from the year 2014. In 2014 the area in Cheongwon-gun and the near areas becomes Cheongju city and the names of the area changes as well as the land area gets bigger comprising the city so there is a little difference between the data provided by the KOSIS when the place was still Cheongwon-gun and the recent data when it already become a city. This preprocess was edited by Microsoft Office 365 Excel program. The construction of the spatial data frame for the dependent variable is based on QGIS 3.8. This geographic information system (GIS) program is also used to visually verify the identification of errors in statistical analysis results. To confirm the concentration of the approximate age-standardized mortality ratio of female colorectal cancer, ASMRs from 250 cities and towns across South Korea are visualized by classifying them into 2 stages, respectively.

Data analysis
The dependent variables are the mean of CRC's ASMR for 9 years from 2010 to 2018 and independent/explanatory variable is year, city, gender, and cause of death. The spatial weight of the dependent variable is calculated based on the adjacency of the borders between regions. Moreover, the presence or absence of autocorrelation is determined by calculating Global Moran's I value. The local cluster pattern analysis of the dependent variable is identified as the hot spot area and the cold spot area using Getis-Ord Gi � . For this calculation and identification, the statistical analysis programs R version 3.3.1 and RStudio version 1.1.463 were run on the Windows 10 Pro, 64-bit operating system platform.
To find the average difference between the two cluster regions, the mean of each explanatory variable, the equivariance by F-test, and the probability of significance by T-test were calculated. The regression model is used to support the statistical correlation of the CRC death factors. A test on explanatory variables is conducted to analyze this regression model along with the differences between the two regions. First, a multicollinearity test was conducted to examine the linear correlation. Second, the variance is tested in three ways to demonstrate the uniformity of data variance in two different cluster regions. Third, the Durbin-Watson test is performed to diagnose the autocorrelation of this model. After the explanatory variable test, the factors of this regression model are analyzed, and the results are interpreted.

Exploratory survey
The number of deaths by age 5 years gap is explored to determine the more precise age group of female colorectal cancer mortality [19]. As shown in Tables 1 and 2 finding, the elderly aged 65 years or older accounted for more than 70% of the total mortality; among the deaths of the colon, rectal and anal cancers.

Descriptive statistics analysis for cluster analysis
The observations in this study are organized by region into 250 municipalities across the country. Its mean is 8.15 and its standard deviation is 1.35.
Step classification. To visually examine the regional pattern of the age standardized mortality ratio in female CRC, 250 regions are divided into two steps. The ASMR of female colorectal cancer showed that the central region was higher than the southern region. In addition, to identify patterns with quantitative values, the distance between the 250 administrative districts of the county and the center point of the county districts was calculated by the NNI of Eqs (3)- (5).
The NNI of the regions was 1.15, the z value was 4.62, and the p value was <0.001. Since the probability of significance was significant at 95% CI, it was found that it had a random pattern. Flow chart for study method. It is the schematic diagram for this research method. After investigating the existing literature and theories, 250 regions were classified into hot and cold spots according to age standardized mortality ratio (ASMR), and the mean differences in causes were analyzed. The causes of these spots are also analyzed and interpreted. https://doi.org/10.1371/journal.pone.0273995.g001

Diagnosis of spatial autocorrelation of dependent variables
The global Moran's I value was calculated by using Eq (6) to confirm the presence of clustering of female CRC age standardized mortality ratio in municipalities across the country and to quantify spatial distribution patterns. The weight of the Moran's I value was calculated by using the Queen method to determine whether the borders between regions are adjacent. Moran's I by contiguity weight matrix shows that the Moran's I value of the dependent variable by adjacency weight is 0.117 (SD: 1.77, p = 0.038, expectation: -0.0051, variance: 0.00472). Since this value is close to 0, spatial autocorrelation has a weak positive (+) random spatial pattern, and it is significant at the 5% significance level.

Classification of hot spot and cold spot areas
As the ASMRs in the municipalities of the country have random spatial distributions, Getis-Ord Gi � was used in Eq (7) to identify hot spots and cold spot areas. The 250 regions were divided into 3 hot spots, 3 cold spots and 1 other area by z value. The breaks for z value were min, -2.58, -1.96, -1.65, Not significant, 1.65, 1.96, 2.58, and max. In addition, the clustered regions were classified into three categories; z = 1.64, z = 1.96 and z = 2.58. Table 3 is a table that identifies areas with a z value of ±1.96 or higher as 10 hot spot areas and 9 cold spot areas. The hot spot area had many inland areas, and the cold spot area had many southwestern coastal areas or in the surrounding coastal areas.

Definition of dependent and explanatory variables
In order to compare risk factors in the clustered areas of Korea, dependent and explanatory variables were selected in areas with high and low female CRC cancer incidence. The dependent variables are the CRC ASMR of the two regions. And the explanatory variables are the old

Definition of dependent variables for cluster analysis
The clustering of age-standardized mortality ratio [21] for female colorectal cancer in national administrative districts is conducted to identify the family history and genetic factors

PLOS ONE
mentioned in the introduction. This ASMR for CRC is the age-standardized mortality ratio for CRC(colon, rectal and anal cancers) based on 100,000 people. It shows the dependent variables used for visualization and spatial autocorrelation analysis in this paper. The agricultural products like vegetables that are full of magnesium, calcium, starch(glucose) and folic acid e.g. garlic as well as the protein filled livestock product like red meat and processed meat are important variables because those are the foods that influence in the occurrence of the CRC that is not only shown as the result of this study but also in the previous studies.

Analysis of differences in clustered areas
Ten hotspot regions and nine cold spot regions in Table 3 are created and compared by year. Seowon-gu, Cheongju-si was established in July 2014, and there are no data from 2010 to 2013. The hotspot area used 86 frames except for 4 frames in Seowon-gu and Cheongju-si, and 81 frames in the cold spot area. Table 5 shows the technical statistics for the hot and cold spot areas. Table 6 shows the results of the F-test to compare whether the two clustered regions satisfy homoscedasticity. In this table, Var. Equal gave true or false depending on whether the significant probability of variables was 0.05 or more as a result of homoscedasticity. If true, the Two-Sample T-test was executed, and if false, the Welch Two-Sample T-test was executed.
The results of Table 6's homoscedasticity is analyzed by means of the independent sample T-test of Table 7 to determine whether the mean was different between the two regions.
As shown in the difference between the two regions in Table 7, the two regions had significant mean differences in CRC ASMR, aging rate, CRC mortality, smokers of 66 years old, drinking rate, abdominal obesities of 66 years old, ASMR for diabetes, CRC ASIR, intermediate consumption amount, seafood production, and livestock product shipments. However, there are no significant differences in the aged population, ASMR for breast cancer, or agricultural product shipments in the two regions. Among the variables with significant differences, This table is the statistics for the spot areas in S11 Dataset in S1 File, and the names of the areas are mentioned in Table 3. https://doi.org/10.1371/journal.pone.0273995.t005

PLOS ONE
CRC ASMR, CRC mortality, smokers of 66 years old, abdominal obesities of 66 years old, ASMR for diabetes, and CRC ASIR had higher averages in the hotspot area. On the other hand, the aging rate, drinking rates, intermediate consumption amounts, seafood production, and livestock product shipments were higher in the cold spot area. This table shows the results of estimating the significant mean differences between hot and cold spots using S11 Dataset in S1 File. � tmeasures the size of the difference relative to the variation in sample data.
�� df is degrees of freedom which are the amount of information the data and this value is determined by the number of observations. https://doi.org/10.1371/journal.pone.0273995.t007

Descriptive statistical analysis for regression model
The hot and cold spot areas applied in this study represent the summary statistics of the regression model in Table 8. The number of observations in this model is 167, and the definition of each of the dependent and explanatory variables is as mentioned in Table 8.

Diagnosis for regression model
Multicollinearity test. Table 8 shows the results of this test to examine the linear correlation of explanatory variables. Since the largest value of the variance inflation factor (VIF) is 6.55 and all tolerance values are greater than 0.1, it was determined that multicollinearity is not a problem.
Homogeneity test. For the diagnosis of homoscedasticity using the null hypothesis, three tests of Kruskal-Wallis rank sum, Wilcoxon rank sum, and Fligner-Killeenare conducted, and the results are shown in Table 9. As a result of the tests, homoscedasticity exists because the Pvalue is greater than 0.05 as 0.1745, 0.1745 and 0.1834, respectively.
Autocorrelation test. The Durbin-Watson test confirmed whether the residuals were autocorrelated. As the results of this test, autocorrelation is -0.0328, DW statistic is 2.050599,

PLOS ONE
and p-value is 0.766, those are not significant. Since the DW statistic is close to 2, there is no autocorrelation.

Analysis result of influence factors
The analysis results of the regression model are shown in Table 10, and have 30.6% explanatory power by the modified decision coefficient (Adj_R 2 ). In addition, since the p-value is less than 0.05 and the F-statistic is 6.619, it shows that this model is significant. The statistics of the residuals show the normal distribution of the residuals, with a median of -0.211, slightly skewed to the left. Also, the size of 1Q is slightly larger than 3Q. Nevertheless, the distribution is symmetrical and is not greatly skewed to one side. Women with colorectal cancer ASMR in hotspot and cold spot regions have the number of deaths in female colorectal cancer (99.9% CI), diabetes (95% CI), female colorectal cancer ASIR (95% CI), and the aged population (99.9% Cl) was significantly affected. The regression equation for this model is equal to Eq (2). This leads to predication that as these number of CRC mortality, ASMR for diabetes, and CRC ASIR increases, the ASMR for female colorectal cancer will increase. In addition, as the aged population increased, the standardized mortality rate for colorectal cancer decreased. This suggests that while the aged population increases, the ASMR of colorectal cancer will further decrease on a 100,000 basis. y ¼ 4:917e À 1 � CRC mortality ð Þ þ 1:700e À 1 � ASMR for diabetes ð Þ þ 1:400e À 1 � CRC ASIR ð Þ þ À 5:650e À 1 � Aged population ð Þ ð2Þ

Discussion
In previous studies, women's age, abdominal obesity, meat, fish, and vegetable food, breast cancer, and diabetes were factors influencing CRC. According to some studies, there is rarely a link between CRC and alcohol consumption. This study shows that breast cancer and vegetable food had no effect on CRC ASMR. In addition, smoking and drinking, abdominal obesity, breast cancer, meat, fish, and vegetable food hardly affected CRC ASMR in Table 10. Therefore, diabetes was an important factor influencing CRC ASMR, which was commonly found in previous studies. Also, this study showed the relatively significant differences between the two regions, such as aging rate, CRC mortality, smokers, drinking rate, abdominal obesity, diabetes, CRC ASIR, consumption amount, seafood production, and livestock products but it is the aged population, CRC mortality, diabetes, and CRC ASIR have the most important influence.
This study managed the differences between study setting and result. First, there was a change in administrative area and the estimation of some data as a limitation of data construction. This study analyzed presumptively the areas of Cheongju and Suwon among the administrative districts in the hotspot area on a nine-year basis. Second, the data on food consumed by the entire resident were compared because the exact data consumed by the population aged 65 or older in the hot spot and cold spot areas were not known.

Conclusion
Diabetes is a common factor of CRC in existing studies and it is proven in this research. This research can be utilized as a foundation for diabetes prevention. In addition, the difference between the hot spot and the cold spot could be used as a mediation factor for the cluster in the future. Due to the limitations of the sizes provided by KOSIS, the SD values of smoker, agricultural product shipments, seafood production, and livestock product shipments are pretty small compared to the means. The findings of this study on the relation between the SD and the mean support the common trend of the mean increasing as the sampling distribution decreases.
The study found that diabetes was the most common factor in CRC. These two methods refer to regression analysis of available factors and the comparison of z values in the hot spot and cold spot areas. The age standardized mortality ratio in southwestern coastal areas or in the surrounding coastal areas is lower than the other areas in South Korea. 95% confidence interval (CI), a z value lower than -1.96 or higher than +1.96 indicates that the distribution was not randomly distributed [36].
The nearest-neighbour index is: Where R is the index, d o is the observed mean nearest neighbor distance, and d E is the expected mean nearest neighbor distance for a random disposition of points: Where p is the number of points divided by area, a test of significance is provided by: Where z denotes the normal standard deviation (the sampling distribution is normal) and SEd o denotes the standard error of the mean nearest-neighbour distance [35]. The Moran's I index. The existence of global autocorrelation can be quantified and generalized by the distribution pattern of individuals in the region through Moran's I index [37]. The Moran's I index is equal to Eq (6).
n: number of observations w ij : spatial weight between i and j area In Eq (6), w ij which is a component of the spatial weight value w, is assigned a weight value based on the whether the unit regions i and j share a boundary. It is 0 if not adjacent to each other and 1 if neighboring. The value of spatial region i is x i , and � x is the average value of variable x. And n is the number of unit areas. The Moran's I index represents a spatial cluster of heterogeneous attributes as it approaches -1, and a cluster of similar attributes as it approaches 1. Also, if the global Moran index is 0, there is no spatial autocorrelation.
The characteristics of Seoul's building supply and changes in spatial clustering patterns were studied. Methods for analyzing the distribution pattern of spatial phenomena are global and local cluster pattern analysis. Global cluster analysis uses Getis-Ord General G(General G) or Moran's I, and local cluster analysis uses the LISA index or Getis-Ord Gi � (Gi � ). And the hot spot analysis uses General G와Gi � [38].
Gi � calculates the z-value and p-value of the spatial unit in the target area as in Eq (7). Through this, it is possible to grasp statistically significant clustering tendencies of high attribute values (hot spots) and clustering tendencies of low attribute values (cold spots). It also has the advantage of being able to plot the analysis in spatial units [38].
i, j: unit of analysis x i , x j : attribute data of i and j area w ij : spatial weight between i and j area n: number of unit of analysis WSD: weighted standard distance

The Durbin-Watson test
The Durbin-Watson test is a test that is used to determine if the residual error satisfies the assumption of independence and whether autocorrelation exists. As in Eq (8), the statistic of this test is called d, and e t is the residuals remaining after prediction. When the value of d is close to 0 or close to 4, autocorrelation exists between the residuals. Independence should be maintained between the residuals, and when there is no significant relationship between the residuals, it should come close to 2 [39].
e t is the residual. Where T is the number of observations.
Supporting information S1 File. Supplementary for the dataset in the manuscript. (DOCX)